Azure Synapse SQL Pool Query Hint

 

Query hints are a great way to enhance query performance and force the optimizer to perform certain operations that an engineer may deem to be more optimal. The example below will show you how a query that can potentially take over an hour to run will run around 15 minutes by leveraging a query hint. Enforcing the importance of query hints in a Data Engineers repertoire.  

 

Test Case:

In this test case, we have the below query that takes over an hour to execute on a DW100c instance. We are leveraging a large resource class to execute the query as well. Result set cache is disabled from the session level to ensure cache data is not being used.



------------------------------------------Start----------------------------------------------
SET RESULT_SET_CACHING OFF
SELECT
   
Coalesce(p.[ModelName], p.[EnglishProductName]) AS [Model]
    ,g.City
AS ResellerCity
    ,g.StateProvinceName
AS StateProvince
    ,
Year(f.OrderDate) AS CalendarYear
    ,
CASE
       
WHEN Month(f.OrderDate) < 7 THEN Year(f.OrderDate)
       
ELSE Year(f.OrderDate) + 1
   
END AS FiscalYear -- Fiscal year runs from Jul to June)
    ,
Month(f.OrderDate) AS [Month]
    ,
Sum(f.OrderQuantity) AS Quantity
    ,
Sum(f.ExtendedAmount) AS Amount
    ,Approx_count_distinct(f.SalesOrderNumber)
AS UniqueOrders 
FROM
    [dbo].[FactResellerSales_HASH_CCI] f
INNER JOIN [dbo].[DimReseller] r
   
ON f.ResellerKey = r.ResellerKey
INNER JOIN [dbo].[DimGeography] g
   
ON r.GeographyKey = g.GeographyKey
INNER JOIN [dbo].[DimProduct] p
   
ON f.[ProductKey] = p.[ProductKey]
GROUP BY
   
Coalesce(p.[ModelName], p.[EnglishProductName])
    ,g.City
    ,g.StateProvinceName
    ,
Year(f.OrderDate)
    ,
CASE
       
WHEN Month(f.OrderDate) < 7 THEN Year(f.OrderDate)
       
ELSE Year(f.OrderDate) + 1
   
END
    ,
Month(f.OrderDate)
ORDER BY Amount DESC
------------------------------------------End----------------------------------------------

 

 

We first purposely neglect to create any user defined statistics and disabled the auto creation of system statistics on the database to also reinforce the importance of statistics on the data warehouse system. Within 25 minutes of executing, we received a “Msg 1105, Level 16, State 2, Line 31” indicating that TempDB could not allocate any additional space.

 

We proceed to create statistics with a sample size of 50% to confirm if we can at least get the query to run and not fail.

------------------------------------------Start----------------------------------------------
CREATE STATISTICS [FactResellerSales_HASH_CCI_ord_nq]
ON [dbo].[FactResellerSales_HASH_CCI]([OrderDate],OrderQuantity,ExtendedAmount  ) WITH SAMPLE 50 PERCENT;

CREATE STATISTICS [FactResellerSales_HASH_CCI_prodk]
ON [dbo].[FactResellerSales_HASH_CCI]([ProductKey] ) WITH SAMPLE 50 PERCENT;

CREATE STATISTICS [FactResellerSales_HASH_CCI_resek]
ON [dbo].[FactResellerSales_HASH_CCI]([ResellerKey] ) WITH SAMPLE 50 PERCENT;

CREATE STATISTICS [DimReseller_resek_stat]
ON [dbo].[DimReseller]([ResellerKey] ) WITH SAMPLE 50 PERCENT;

CREATE STATISTICS [DimReseller_geogr_stat]
ON [dbo].[DimReseller]([GeographyKey] ) WITH SAMPLE 50 PERCENT;

CREATE STATISTICS [DimGeography_geogr_stat]
ON [dbo].[DimGeography](GeographyKey ) WITH SAMPLE 50 PERCENT;

CREATE STATISTICS [DimGeography_city_stat]
ON [dbo].[DimGeography](City ) WITH SAMPLE 50 PERCENT;

CREATE STATISTICS [DimGeography_engprod_stat]
ON [dbo].[DimGeography](EnglishProductName ) WITH SAMPLE 50 PERCENT;

CREATE STATISTICS [DimProduct_prodkey_stat]
ON [dbo].[DimProduct]([ProductKey] ) WITH SAMPLE 50 PERCENT;

CREATE STATISTICS [DimProduct_prodname_stat]
ON [dbo].[DimProduct]([EnglishProductName] ) WITH SAMPLE 50 PERCENT;

CREATE STATISTICS [DimProduct_ModelName_stat]
ON [dbo].[DimProduct]([ModelName] ) WITH SAMPLE 50 PERCENT;
------------------------------------------End----------------------------------------------

 

 

 

While the query is able to run, we are unable to get it to run within an hour before our ETL loading job need to remove sessions and run.

As you can see below, there are three expensive sort operations which are taking predominately most of the cost of the query.

 

Chart

Description automatically generated

 

We proceed to adjust the query and add the HASH GROUP hint to force the optimizer to perform a hash aggregate with the grouping data. We now see instead of three sort operations, the data is only sorted at the end and only once. In addition, the query only takes 15 minutes to complete. Which is a huge performance improvement for our runtimes, which fits into the ETL window.

SELECT
   
Coalesce(p.[ModelName], p.[EnglishProductName]) AS [Model]
    ,g.City
AS ResellerCity
    ,g.StateProvinceName
AS StateProvince
    ,
Year(f.OrderDate) AS CalendarYear
    ,
CASE
       
WHEN Month(f.OrderDate) < 7 THEN Year(f.OrderDate)
       
ELSE Year(f.OrderDate) + 1
   
END AS FiscalYear -- Fiscal year runs from Jul to June)
    ,
Month(f.OrderDate) AS [Month]
    ,
Sum(f.OrderQuantity) AS Quantity
    ,
Sum(f.ExtendedAmount) AS Amount
    ,Approx_count_distinct(f.SalesOrderNumber)
AS UniqueOrders 
FROM
    [dbo].[FactResellerSales_HASH_CCI] f
INNER JOIN [dbo].[DimReseller] r
   
ON f.ResellerKey = r.ResellerKey
INNER JOIN [dbo].[DimGeography] g
   
ON r.GeographyKey = g.GeographyKey
INNER JOIN [dbo].[DimProduct] p
   
ON f.[ProductKey] = p.[ProductKey]
GROUP BY
   
Coalesce(p.[ModelName], p.[EnglishProductName])
    ,g.City
    ,g.StateProvinceName
    ,
Year(f.OrderDate)
    ,
CASE
       
WHEN Month(f.OrderDate) < 7 THEN Year(f.OrderDate)
       
ELSE Year(f.OrderDate) + 1
   
END
    ,
Month(f.OrderDate)
ORDER BY Amount DESC
OPTION(HASH GROUP)

 

Chart

Description automatically generated with medium confidence

 

Conclusion:

As we have seen above, query hints play an integral role in truly enhancing workloads. We went from over an hour runtime, to around 15 minutes. It’s important for Data Engineers to exhaust all optimization options to ensure a performant running query in a data warehouse.

Rule of thumb, for group by and order by queries, confirm if OPTION(HASH GROUP) would allow for further optimization. For statements with several join operations, using smaller dimension tables as the leading table in the join, confirm if OPTION(FORCE ORDER) would enhance the query run times. For a list of additional query hints, please visit: OPTION Clause (Transact-SQL) - SQL Server | Microsoft Docs

 

DISCLAIMER: Sample Code is provided for the purpose of illustration only and is not intended to be used in a production environment unless thorough testing has been conducted by the app and database teams. THIS SAMPLE CODE AND ANY RELATED INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A PARTICULAR PURPOSE. We grant You a nonexclusive, royalty-free right to use and modify the Sample Code and to reproduce and distribute the object code form of the Sample Code, provided that. You agree: (i) to not use Our name, logo, or trademarks to market Your software product in which the Sample Code is embedded; (ii) to include a valid copyright notice on Your software product in which the Sample Code is embedded; and (iii) to indemnify, hold harmless, and defend Us and Our suppliers from and against any claims or lawsuits, including attorneys’ fees, that arise or result from the use or distribution or use of the Sample Code.